Main goals of the session

  1. Retrieve sequences from databases and build multiple sequence alignments.
  2. Estimate species divergence at different functional gene regions
  3. Calculate and compare evolutionary rates of two proteins with different functions.

1. Practicals organization

To put into practice the concepts of species divergence and evolutionary rates that you will learn in the theory lessons, you will analyse nucleotide divergence in two genes (obp83b and rpL32) with different biological functions in five closely related species of the genus Drosophila.

Throughout the document you will see different icons whose meaning is:

: Additional or useful information

: Practical exercise

: Hint to solve an exercise or to do a task

: Slot to answer a question

: Problem or task to be solved


2. Installing R packages

You can use either the R console in the terminal or RStudio for this practice. If you don’t have R installed, you can download the appropriate package for your system here. To install RStudio, go to this page and follow the instructions.

Before starting the exercises, you will need to install some necessary libraries for downloading sequences from the GenBank database, performing multiple sequence alignments, building phylogenetic trees, and calibrating these trees. Open the R console in the terminal (or in RStudio) and type:

packages <- c("reutils", "ape", "seqinr", "phangorn", "phylotools")
#install.packages(setdiff(packages, rownames(installed.packages())))
#install.packages(packages)

3. Retrieving sequence data

You can use the efetch() function implemented in the reutils package to retrieve sequences from databases. These functions allow you to connect to GenBank and download the sequences directly to your computer in FASTA format. You will need the GenBank identifier of either the gene region or the complete genome sequence of the species and the coordinates of the desired gene region in that genome. Note that for some species the gene is encoded in the reverse complement sequence relative to the reference in the database. In these cases you will need to specify the strand in efetch. In the example below

Let’s see how to retrieve the coding regions (CDS) of the obp83b gene in different species. Table 1 lists the identifiers and coordinates of this gene region in the five species:


Table 1. Identifiers and genomic coordinates of the obp83b gene in different Drosophila species.

Species ID Coordinates in the genome Strand
D. melanogaster NT_033777 5976177-5976817 2
D. simulans NC_052523 2470507-2471257 2
D. erecta NW_020825194 26500029-26500718 1
D. elegans NW_024545863 12746083-12746802 2
D. pseudopobscura NC_046679 16336491-16337172 1



GenBank_ identifiers can be obtained from the literature (articles) by starting a search with R libraries such as rentrez, directly from the web page of this database, using esearch() from reutils, or using other programmes and more advanced bioinformatics tools. .

To download and write a file in FASTA format with the CDS sequence of the obp83b gene in these species, you can use the efetch() function of reutils.

library(reutils)
library(ape)

## create a new working directory
dir.create("divergence")
Warning in dir.create("divergence"): 'divergence' already exists
## example for D. pseudopobscura:
efetch("NC_046679", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=1, seqstart=16336491, seqstop=16337172, outfile="./divergence/Dpse_obp83b_cds.fasta")
#Get D.melanogaster
efetch("NT_033777", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=2, seqstart=5976177, seqstop=5976817, outfile="./divergence/Dmel_obp83b_cds.fasta")

#Get D.simulans
efetch("NC_052523", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=2, seqstart=2470507, seqstop=2471257, outfile="./divergence/Dsim_obp83b_cds.fasta")

#Get D.erecta
efetch("NW_020825194", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=1, seqstart=26500029, seqstop=26500718, outfile="./divergence/Dere_obp83b_cds.fasta")

#Get D.elegans
efetch("NW_024545863", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=2, seqstart=12746083, seqstop=12746802, outfile="./divergence/Dele_obp83b_cds.fasta")

#Command line execution (otherwise use read.dna and write.dna)
#cat D*_obp83b_cds.fasta > obp83b_cds.fasta
EntryName<-function(folder, name){
    folder<-paste0("./", folder, "/")
    f<-paste0(folder, name,"_cds.fasta")
    sequence<-read.dna(f, format="fasta")
    seqnc<-updateLabel(sequence, labels(sequence), name)
    write.dna(seqnc, format="fasta", file=paste0(folder, "obp83b_cds.fasta"), append=TRUE)
}
new_names<-c("Dmel_obp83b", 
             "Dsim_obp83b", 
             "Dere_obp83b", 
             "Dele_obp83b", 
             "Dpse_obp83b")
for (name in new_names){
 EntryName("divergence", name) 
}

db, specifies the database; rettype, specifies the subsequence to be extracted (=CDS) and the format (=FASTA); retmode, specifies the file format (=plain text); strand, 1: positive strand, 2: negative (reverse complement with respect to the reference in the database); seqstart and seqstop, are the coordinates of the sequence to be downloaded (required if we are extracting gene regions from the complete genome sequence set).


Exercise 1

  1. Download the rest of CDS sequences of the obp83b gene in the “divergence” folder and name the files as in the example (starting with a three-letter code identifying the species).
  2. Create a new file (e.g. “obp83b_cds.fasta”) with the CDS of the five species in fasta format (multi-fasta file).
    You can do this manually or, for example, use the read.dna() and write.dna() functions in the ape package.

4. Multiple sequence alignment

To identify the nucleotide substitutions that have occurred in the coding region of the opb83b gene during the divergence of the four species, one must first predict which positions are homologous (i.e. derived from a common ancestor). There are many algorithms and programs for constructing multiple sequence alignments of DNA sequences. For purely practical reasons, in this session you will use ape, an R package that implements some of these algorithms.

library(ape)

## read the sequences
myseqs<-read.dna("divergence/obp83b_cds.fasta", format="fasta")

## visualize the sequences
myseqs
5 DNA sequences in binary format stored in a list.

Mean sequence length: 427.8 
   Shortest sequence: 426 
    Longest sequence: 429 

Labels:
Dmel_obp83b
Dsim_obp83b
Dere_obp83b
Dele_obp83b
Dpse_obp83b

Base composition:
    a     c     g     t 
0.252 0.253 0.271 0.224 
(Total: 2.14 kb)
## multiple sequence alignment
myalign <- clustal(myseqs)

## partial visualization of the generated alignment
myalign
5 DNA sequences in binary format stored in a matrix.

All sequences of same length: 429 

Labels:
Dmel_obp83b
Dsim_obp83b
Dere_obp83b
Dele_obp83b
Dpse_obp83b

Base composition:
    a     c     g     t 
0.252 0.253 0.271 0.224 
(Total: 2.14 kb)
## write a file (maintaining the gaps introduced in the msa step)
write.dna(myalign, file="divergence/obp83b_cds_aln.fasta", format="fasta")

5. Divergence at synonymous and nonsynonymous sites

Coding regions are special for molecular evolutionary analyses because they contain two different types of positions in terms of the functional consequences of mutations, namely synonymous and nonsynonymous positions. Synonymous changes are those that do not alter the amino acid encoded by the codon, whereas nonsynonymous changes cause an amino acid substitution in the protein. Consequently, synonymous sites are sites where if a change occurs, it will be a synonymous change. The same applies to nonsynonymous sites. Studying and comparing the variability of these two types of sites is very informative about selective constraints and the functional consequences of mutations.

Divergence is estimated as the number of substitutions per site (this allows the divergence of regions of different lengths to be compared). Therefore, to estimate synonymous and nonsynonymous divergence, we first need to calculate the number of sites of each type in the coding regions of interest. The most commonly used method to estimate these sites is the Nei-Gojobori (N-G) method (Figure 1). The N-G approach consists of estimating the proportion of changes that would be synonymous and nonsynonymous in a given codon after considering all nine possible nucleotide changes (based on the genetic code).


Figure 1. Nei-Gojobori method for estimating the number of synonymous and nonsynonymous sites in a codon


Here are some examples of functions for working with changes and synonymous sites:

library(seqinr)

Attaching package: 'seqinr'
The following objects are masked from 'package:ape':

    as.alignment, consensus
translate<-seqinr::translate

## function to translate a codon:
dna_to_aa <- function(codon) {
 dna<-s2c(codon)
 aa<-translate(dna, frame = 0, sens = "F", numcode = 1, NAstring = "X", ambiguous = FALSE)
 return (aa)
}  

## function to determine if a change between two codons is synonym
is_synonymous<-function(codon1, codon2) {
  return (dna_to_aa(codon1) == dna_to_aa(codon2))
}

## function to estimate the number of synonymous sites in a codon:
synpos<-function(codon) {
pos<-s2c(codon)
for (i in 1:length(pos)) {
   base = pos[i]
   BASES = c('A', 'G', 'T', 'C')
   other_bases = BASES[BASES!=base]
   syn=0
     for (new_base in other_bases) {
          new_pos=c(pos[0:(i-1)], new_base, pos[-(1:i)])
          new_codon=paste(new_pos, collapse="")
        s<-(is_synonymous(codon, new_codon))
          syn <- syn + length(s[s==TRUE])
     }
   synp <- syn/3
}
return (synp)
}

Exercise 2

Calculate the number of synonymous and nonsynonymous changes and sites in this coding sequence alignment using the functions above:

Seq 1 ATGATGCAGAGTCTGTAA
Seq 2 ATGAGGCACAGTCTGTAA

The number of sites of each class in each sequence is the sum over all codons in the sequence, and the total number of sites of each class in the alignment is the average over all sequences. You can apply the functions separately for each codon, or consider a loop integrating functions such as s2c() and splitseq() from the seqinr package

seq1<-"ATGATGCAGAGTCTGTAA"
seq2<-"ATGAGGCACAGTCTGTAA"

aa1<-dna_to_aa(seq1)
aa2<-dna_to_aa(seq2)

syn_list<-is_synonymous(seq1,seq2)
syn<-sum(syn_list)
nonsyn<-length(syn_list)-syn


changeCounter<-function(seq){
  i=3
  pre=0
  seq_Syn=0
  
  while (i<=nchar(seq)){
    s<-substr(seq, pre, i)
    
    seq_Syn=seq_Syn+synpos(s)
    pre=i+1
    i=i+3
  }
  seq_NSyn=nchar(seq)-seq_Syn
  #syn=paste0("Synonymous: ", seq_Syn)
  #nsyn=paste0("Non-synonymous: ", seq_NSyn)
  return(c(seq_Syn, seq_NSyn))
}

seq1_data<-changeCounter(seq1)
seq2_data<-changeCounter(seq2)

syn_sites<-(seq1_data[1]+seq2_data[1])/2
nsyn_sites<-(seq1_data[2]+seq2_data[2])/2
Seq1:
    Synonymous changes: 2 
    Non-synonymous changes: 16

Seq2:
    Synonymous changes: 2.333333 
    Non-synonymous changes: 15.66667


Synonymous sites: 2.166667 
Non-synonymous sites: 15.83333

Fortunately, you will not need to repeat this calculation for all the gene sequences we will be studying in this lab. There are R functions that can be used to calculate the synonymous and nonsynonymous divergence of a coding region directly from the CDS alignment. The seqinr package implements the KaKs() function, which allows the calculation of these divergences. Before proceeding with the calculations, we will change the sequence names in the fasta file to make it easier to identify the species in the results matrix (now the identifiers are very long and some of them do not contain the species name):

library(phylotools)

Attaching package: 'phylotools'
The following object is masked from 'package:seqinr':

    read.fasta
old_names<-get.fasta.name("divergence/obp83b_cds_aln.fasta")
## new names must be in the same order than old names...
new_names<-c("Dmel_obp83b", 
             "Dsim_obp83b", 
             "Dere_obp83b", 
             "Dele_obp83b", 
             "Dpse_obp83b")
ref2 <- data.frame(old_names, new_names)
rename.fasta(infile = "divergence/obp83b_cds_aln.fasta", ref_table = ref2, outfile = "divergence/obp83b_cds_renamed_aln.fasta")
divergence/obp83b_cds_renamed_aln.fasta has been saved to  /home/jj/Desktop/Bioinformatics/2nd_year/3term/Population_Genetics_ME/Seminars/P4_Estimating_nucleotide_divergences+evolutionary_rates 
cds <- read.alignment("divergence/obp83b_cds_renamed_aln.fasta",format="fasta")

## matrix with synonymous divergences (Ks)
kaks(cds)$ks
            Dele_obp83b Dere_obp83b Dmel_obp83b Dpse_obp83b
Dere_obp83b   0.6182696                                    
Dmel_obp83b   0.5494927   0.3234201                        
Dpse_obp83b   1.0414168   1.4340955   0.9158393            
Dsim_obp83b   0.7103626   0.3617048   0.1256942   0.9289957
## matrix with nonsynonymous divergences (Ka)
kaks(cds)$ka
            Dele_obp83b Dere_obp83b Dmel_obp83b Dpse_obp83b
Dere_obp83b  0.03576413                                    
Dmel_obp83b  0.03749267  0.03931737                        
Dpse_obp83b  0.07112614  0.08614269  0.07829961            
Dsim_obp83b  0.02529429  0.02555604  0.01992201  0.07511411

6. Divergence at noncoding sites

In addition to the CDS, noncoding sequences are also found in eukaryotic genes. These sequences correspond to introns, exonic untranslated regions (5’ or 3’) or 5’ proximal regions, often containing promoters and regulatory elements. Here, you will analyse the divergence in the 5’ proximal region and the introns of the obp83b gene in the same five species.

First, you will obtain noncoding sequences of this gene for all species. To do this, you need to know exon coordinates. Again, you can use reutils functions to read the GenBank entries in your R terminal:

## example for D. pseudoobscura:
x<-efetch("NC_046679", db="nucleotide", rettype="gb", retmode="text", strand=1, seqstart=16336491, seqstop=16337172)
con <- content(x, as = "textConnection")
readLines(con)
 [1] "LOCUS       NC_046679                682 bp    DNA     linear   CON 14-APR-2020" 
 [2] "DEFINITION  Drosophila pseudoobscura strain MV2-25 chromosome 2, UCI_Dpse_MV25," 
 [3] "            whole genome shotgun sequence."                                      
 [4] "ACCESSION   NC_046679 REGION: 16336491..16337172"                                
 [5] "VERSION     NC_046679.1"                                                         
 [6] "DBLINK      BioProject: PRJNA622252"                                             
 [7] "            BioSample: SAMN13616452"                                             
 [8] "            Assembly: GCF_009870125.1"                                           
 [9] "KEYWORDS    WGS; RefSeq."                                                        
[10] "SOURCE      Drosophila pseudoobscura"                                            
[11] "  ORGANISM  Drosophila pseudoobscura"                                            
[12] "            Eukaryota; Metazoa; Ecdysozoa; Arthropoda; Hexapoda; Insecta;"       
[13] "            Pterygota; Neoptera; Endopterygota; Diptera; Brachycera;"            
[14] "            Muscomorpha; Ephydroidea; Drosophilidae; Drosophila; Sophophora."    
[15] "COMMENT     REFSEQ INFORMATION: The reference sequence is identical to"          
[16] "            CM020868.1."                                                         
[17] "            Assembly name: UCI_Dpse_MV25"                                        
[18] "            The genomic sequence for this RefSeq record is from the"             
[19] "            whole-genome assembly released by the University of California,"     
[20] "            Irvine on 2020/01/15. The original whole-genome shotgun project has" 
[21] "            the accession WVEN00000000.1."                                       
[22] "            "                                                                    
[23] "            ##Genome-Assembly-Data-START##"                                      
[24] "            Assembly Provider      :: University of California, Irvine"          
[25] "            Assembly Method        :: Canu v. 1.7"                               
[26] "            Assembly Name          :: UCI_Dpse_MV25"                             
[27] "            Genome Representation  :: Full"                                      
[28] "            Expected Final Version :: Yes"                                       
[29] "            Genome Coverage        :: 280.0x"                                    
[30] "            Sequencing Technology  :: PacBio Sequel"                             
[31] "            ##Genome-Assembly-Data-END##"                                        
[32] "            "                                                                    
[33] "            ##Genome-Annotation-Data-START##"                                    
[34] "            Annotation Provider         :: NCBI"                                 
[35] "            Annotation Status           :: Full annotation"                      
[36] "            Annotation Name             :: Drosophila pseudoobscura Annotation"  
[37] "                                           Release 104"                          
[38] "            Annotation Version          :: 104"                                  
[39] "            Annotation Pipeline         :: NCBI eukaryotic genome annotation"    
[40] "                                           pipeline"                             
[41] "            Annotation Software Version :: 8.4"                                  
[42] "            Annotation Method           :: Best-placed RefSeq; Gnomon"           
[43] "            Features Annotated          :: Gene; mRNA; CDS; ncRNA"               
[44] "            ##Genome-Annotation-Data-END##"                                      
[45] "FEATURES             Location/Qualifiers"                                        
[46] "     source          1..682"                                                     
[47] "                     /organism=\"Drosophila pseudoobscura\""                     
[48] "                     /mol_type=\"genomic DNA\""                                  
[49] "                     /strain=\"MV2-25\""                                         
[50] "                     /db_xref=\"taxon:7237\""                                    
[51] "                     /chromosome=\"2\""                                          
[52] "                     /sex=\"female\""                                            
[53] "                     /tissue_type=\"whole body\""                                
[54] "                     /dev_stage=\"adult\""                                       
[55] "     gene            1..682"                                                     
[56] "                     /gene=\"LOC4801921\""                                       
[57] "                     /note=\"Derived by automated computational analysis using"  
[58] "                     gene prediction method: Gnomon.\""                          
[59] "                     /db_xref=\"GeneID:4801921\""                                
[60] "     mRNA            join(1..128,186..261,317..682)"                             
[61] "                     /gene=\"LOC4801921\""                                       
[62] "                     /product=\"pheromone-binding protein-related protein 6\""   
[63] "                     /note=\"Derived by automated computational analysis using"  
[64] "                     gene prediction method: Gnomon. Supporting evidence"        
[65] "                     includes similarity to: 108 Proteins, and 100% coverage of" 
[66] "                     the annotated genomic feature by RNAseq alignments,"        
[67] "                     including 102 samples with support for all annotated"       
[68] "                     introns\""                                                  
[69] "                     /transcript_id=\"XM_001358902.3\""                          
[70] "                     /db_xref=\"GeneID:4801921\""                                
[71] "     CDS             join(54..128,186..261,317..594)"                            
[72] "                     /gene=\"LOC4801921\""                                       
[73] "                     /note=\"Derived by automated computational analysis using"  
[74] "                     gene prediction method: Gnomon.\""                          
[75] "                     /codon_start=1"                                             
[76] "                     /product=\"pheromone-binding protein-related protein 6\""   
[77] "                     /protein_id=\"XP_001358939.1\""                             
[78] "                     /db_xref=\"GeneID:4801921\""                                
[79] "                     /translation=\"MMKSVGLLMLLLGCVAAQGPRRDAEYPPPAILKLAQHFHDICAP"
[80] "                     KTGVTDAAIKEFSDGEIHEDENLKCYMNCLFHEFEVVDDNGDVHMEKLFNAVPSEKLR" 
[81] "                     NVLMEASKDCIHPEGDTLCHKAWWFHQCWKKADPVHYFLV\""                 
[82] "ORIGIN      "                                                                    
[83] "        1 ccagtccttt gccctcagtc cgctcagtgt ttttgaaaac ttgttgcctg aaaatgatga"     
[84] "       61 aatctgttgg tcttctcatg ctgctacttg gctgtgtagc tgcccaggga cccagacggg"     
[85] "      121 atgctgaggt ttgtagtgga ttcatgacac ttgtaccatt tatcagttat tatctctcgc"     
[86] "      181 cgcagtatcc accgccagca attctaaaat tggcccaaca cttccacgat atttgtgccc"     
[87] "      241 caaaaactgg tgtcactgat ggtaggtcct actctttacc attttttcga ttactcttca"     
[88] "      301 atttcggtct ctttagcggc catcaaggag ttcagcgacg gagagataca tgaggacgag"     
[89] "      361 aatctcaagt gctacatgaa ctgtctcttc cacgagttcg aagtggtcga tgataatggc"     
[90] "      421 gatgtacata tggagaagct attcaatgcc gttccctcgg aaaagctgcg caacgtattg"     
[91] "      481 atggaggcct ccaaggattg catccatccg gagggcgaca ccttgtgcca caaggcctgg"     
[92] "      541 tggttccatc aatgctggaa gaaagctgat cctgtccact atttcttggt ctaagatgta"     
[93] "      601 aatgtctttg caatatttat gtttatgtga atgtgtgttt atattaagct taataaatgg"     
[94] "      661 ttaaaggttc atttaaacat ta"                                              
[95] "//"                                                                              
[96] ""                                                                                
[97] ""                                                                                
#Get D.melanogaster
y<-efetch("NT_033777", db="nucleotide", rettype="gb", retmode="text", strand=2, seqstart=5976177, seqstop=5976817)
con <- content(y, as = "textConnection")
#readLines(con)

#Get D.simulans
z<-efetch("NC_052523", db="nucleotide", rettype="gb", retmode="text", strand=2, seqstart=2470507, seqstop=2471257)
con <- content(z, as = "textConnection")
#readLines(con)

#Get D.erecta
a<-efetch("NW_020825194", db="nucleotide", rettype="gb", retmode="text", strand=1, seqstart=26500029, seqstop=26500718)
con <- content(a, as = "textConnection")
#readLines(con)

#Get D.elegans
b<-efetch("NW_024545863", db="nucleotide", rettype="gb", retmode="text", strand=2, seqstart=12746083, seqstop=12746802)
con <- content(b, as = "textConnection")
#readLines(con)

The readLines() function prints the entry in GenBank format. Locate the feature “CDS” to find the coordinates of the coding region of the gene obp83b in this species.

Once you know the coordinates of CDS, you can extract 5’ and intron sequences:

    ## example for D.pseudobscura
    ## download the complete gene region
    efetch("NC_046679", db="nucleotide", rettype="fasta", retmode="text", strand=1, seqstart=16336491,      seqstop=16337172, outfile="divergence/Dpse_obp83b.fasta")

    ## create an object with the sequence of the complete gene region
    sequence<-read.dna("divergence/Dpse_obp83b.fasta", format="fasta")

    ## create a new sequence with only intron sequences
    ## CDS = join(54..128,186..261,317..594)
    ## 5' region = 1-53
    ## intron 1 = 129-185
    ## intron2 = 262-316
    seqnc<-sequence
    seqnc=seqnc[,c(1:53,129:185,262:316)]
    
    ## change species label
    seqnc<-updateLabel(seqnc, labels(seqnc), "Dpse_obp83b")

    ## write a fasta file with noncoding sequences of the opb83b gene of ALL SPECIES (with append=TRUE, if your run this script for all species, all sequences will be added to the same file; remember to change the ids in each case)
    write.dna(seqnc, format="fasta", file="divergence/obp83b_cds.fasta", append=TRUE)

Now we can align the non-coding sequences and estimate the number of substitutions per site (KNC) of the obp83b gene. For nucleotide divergences (without any distinction of positions, just total divergence) you can use the function dist.dna() from the package ape.

    ## read the noncoding sequences
    myseqs<-read.dna("divergence/obp83b_cds.fasta", format="fasta")

    ## visualize the sequences
    myseqs
6 DNA sequences in binary format stored in a list.

Mean sequence length: 384 
   Shortest sequence: 165 
    Longest sequence: 429 

Labels:
Dmel_obp83b
Dsim_obp83b
Dere_obp83b
Dele_obp83b
Dpse_obp83b
Dpse_obp83b

Base composition:
    a     c     g     t 
0.247 0.252 0.263 0.237 
(Total: 2.3 kb)
    ## multiple sequence alignment
    myalign <- clustal(myseqs)

    ## partial visualization of the generated alignment
    myalign
6 DNA sequences in binary format stored in a matrix.

All sequences of same length: 430 

Labels:
Dmel_obp83b
Dsim_obp83b
Dere_obp83b
Dele_obp83b
Dpse_obp83b
Dpse_obp83b

Base composition:
    a     c     g     t 
0.247 0.252 0.263 0.237 
(Total: 2.58 kb)
    ## write the alignment to file
    write.dna(myalign, file="divergence/obp83b_nc_aln.fasta", format="fasta")
    
    ## read the noncoding alignment
    nc <- read.dna("divergence/obp83b_nc_aln.fasta", format="fasta")

    ## noncoding divergences in the opb83b gene
    dist.dna(nc)
            Dmel_obp83b Dsim_obp83b Dere_obp83b Dele_obp83b Dpse_obp83b
Dsim_obp83b  0.06368852                                                
Dere_obp83b  0.12667685  0.09786105                                    
Dele_obp83b  0.11196991  0.12706643  0.13419579                        
Dpse_obp83b  0.14909196  0.15745580  0.22224915  0.13464234            
Dpse_obp83b  0.57679173  0.53319951  0.50684131  0.51517541  0.51007558

Exercise 3

Fill empty cells in the following tables with the results obtained in the different divergence analyses and answer the questions:

Table 1. Divergence in the coding region of the gene opb83b(Ka and Ks values are above and below the diagonal, respectively)

Dmel Dsim Dere Dele Dpse
Dmel ——— 0.01992201 0.03931737 0.03749267 0.07829961
Dsim 0.1256942 ——— 0.02555604 0.02529429 0.07511411
Dere 0.3234201 0.3617048 ——— 0.03576413 0.08614269
Dele 0.5494927 0.7103626 0.6182696 ——— 0.07112614
Dpse 0.9158393 0.9289957 1.4340955 1.0414168 ———

Divergence in the non-coding regions of the gene opb83b (Fill in the table with KNC values above the diagonal)

Dmel Dsim Dere Dele Dpse
Dmel ——— 0.06368852 0.12667685 0.11196991 0.14909196
Dsim ——— ——— 0.09786105 0.12706643 0.15745580
Dere ——— ——— ——— 0.13419579 0.22224915
Dele ——— ——— ——— ——— ———
Dpse ——— ——— ——— ——— ———

Questions:

1. Is the obp83b gene evolving rapidly? And the Obp83b protein?

Answer:

The average values for each species are as follows: Dmel: 0.47611, Dsim: 0.66702, Dere: 1.02618, and Dele: 1.041416. These values show a significant increase, suggesting a rapid evolution of genes.
For the protein, the averages for non-synonymous substitutions are: Dmel: 0.043757, Dsim: 0.041988, Dere: 0.060953, and Dele: 0.0711261. When we compare these to the synonymous substitution averages, we observe that the non-synonymous averages are significantly lower. This indicates negative selection, suggesting that the protein is also undergoing evolutionary changes.

2. In which of the analysed positions (synonymous, nonsynonymous or noncoding) are evolutionary distances higher? What is the reason for this, in your opinion?

Answer:

For synonymous positions. This may be because they don't affect the function of the gene, they have no functional constraints, so more of them can happen without impacting the organism negatively.

3. Could you consider some of the observed changes as selectively neutral? Which ones?

Answer:

The mutations in non-coding regions can be considered as selectively neutral changes because they don't affect the evolution of genes

4. Which species would share a more recent common ancestor? In which evolutionary distances (synonymous, nonsynonymous or noncoding) should we focus to know that?

Answer:

We have observe the non synonymous regions for the smallest difference, the smallest difference in divergence is Dsim with Dmel with a difference in divergence of 0.0018 aprox.

7. Phylogenetic tree reconstruction and tree calibration

Some of the most direct applications of nucleotide divergence estimates are the reconstruction of phylogenetic relationships and the inference of evolutionary rates. To reconstruct the phylogenetic relationships among the five Drosophila species using the information form the obp83b gene, you can use the functions from ape and phangorn packages.

As an example, the following code reconstructs the phylogenetic relationships between the five species based on synonymous divergences. It uses the synonymous divergence matrix estimated above and the Neighbor Joining algorithm

library(phangorn)

cds <- read.alignment("divergence/obp83b_cds_renamed_aln.fasta",format="fasta")

## NJ tree
syntree <- NJ(kaks(cds)$ka)
plot(syntree)

write.tree(syntree, file="divergence/obp83b_Ks.tree")

By definition, a Neighbor Joining tree is an unrooted tree. There is no root node that is the common ancestor of all species. With an unrooted tree, we can tell the phylogenetic relationships between species (i.e. which species have a more recent common ancestor compared to the other species), but not the direction of evolution (i.e. which species arose more recently). In addition, branch lengths represent substitutions per site, in this case the number of substitutions per site. Sometimes, as in our example, we know which branch contains the root of the species under study (see Figure 2).

Figure 2. Phylogenetic relationships among the species of the genus Drosophila(modified from Mol Ecol Resour. 2022;22:1559–1581)

To calculate the evolutionary rate of the Obp83b protein, we need an ultrametic tree, i.e. a rooted tree in which the branch lengths represent time (not substitutions) and at least one calibration point (i.e. a dated node).

## read the tree to a phylo class object
tree<-read.tree(file="divergence/obp83b_Ks.tree")

## plot the unroted tree
plot(tree, main="Obp gene tree (number of nonsynonymous substitutions per site)")

## root the tree using Drosophila pseudoobscura as the outgroup
rooted<-root(tree, "Dpse_obp83b")
plot(rooted)

## set calibration
## select the node corresponding to the ancestor of D. melanogater, D. simulans and D. erecta, and     specify a range of 6.1-18.9 MYA
mrca_node <- getMRCA(rooted, c("Dmel_obp83b", "Dsim_obp83b", "Dere_obp83b")) 
cal <- data.frame(
    node = mrca_node,
    age.min = 6.1,
    age.max = 18.9
)



## fit the model (penalized likelihood)
chr<-chronos(rooted, model="clock", calibration = cal)

Setting initial dates...
Fitting in progress... get a first set of estimates
         (Penalised) log-lik = -0.5106124 
Optimising rates... dates... -0.5106124 

log-Lik = -0.5106124 
PHIIC = 9.02 
# plot de calibrated (ultrametric) tree
plot(chr, main="calibrated tree (million years)")
axisPhylo()
tiplabels()
nodelabels()

chr$edge
     [,1] [,2]
[1,]    6    8
[2,]    8    7
[3,]    7    1
[4,]    7    2
[5,]    8    5
[6,]    6    3
[7,]    6    4
chr$edge.length
[1] 10.390326  5.580626 13.319373 13.319373 18.899999 29.290325 29.290325

The tiplabels() and nodelabels() functions will tell you the id of each node in the tree (they are indicated in the tree). With chr$edge and chr$edge.length you can find the correspondence between edge numbers and edge lengths, and thus the estimated date in million years of each node.

You now have all the parameters you need to calculate the evolutionary rate of the Obp83b protein in Drosophila.

Exercise 4

Calculate the evolutionary rate (r) of the Obp83b protein in Drosophila using the divergences and times estimated in this practice. Think carefully about which divergence to use for this calculation and apply the formula.

kaks(cds)$ka
            Dele_obp83b Dere_obp83b Dmel_obp83b Dpse_obp83b
Dere_obp83b  0.03576413                                    
Dmel_obp83b  0.03749267  0.03931737                        
Dpse_obp83b  0.07112614  0.08614269  0.07829961            
Dsim_obp83b  0.02529429  0.02555604  0.01992201  0.07511411
divergence_times <- c(dele_dere = 21, dele_dpse = 18, dele_dsim = 22, dele_dmel = 22)
ks_values <- c(dele_dere = 0.03576413, dele_dpse = 0.07112614, dele_dsim = 0.02529429, dele_dmel = 0.03749267)
evolutionary_rates <- ks_values /2*divergence_times
mean_evolutionary_rate <- mean(evolutionary_rates)
mean_evolutionary_rate
[1] 0.4265788
cat("The evolutionary rate is", mean_evolutionary_rate)
The evolutionary rate is 0.4265788

Final assignment

Table 2. Identifiers and genomic coordinates of the rpL32 gene in different Drosophila species.

Species ID Coordinates in the genome Strand
D. melanogaster NT_033777 30045229-30046161 2
D. simulans NC_052523 26165102-26166121 2
D. erecta NW_020825194 2228933..2229932 1
D. serrata NW_018367383 6093667..6094458 2
D. pseudopobscura NC_046679 811678-812625 1


Using the data in Table 2, calculate the rate of evolution of the protein RpL32 in Drosophila and answer the following questions. Note that for this gene D. elegans is replaced by D. serrata.

dir.create("Final_ass")
Warning in dir.create("Final_ass"): 'Final_ass' already exists
#RUN ONLY THE FIRST TIME
#efetch("NT_033777", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=2, seqstart= 30045229, seqstop=30046161, outfile="./Final_ass/Dmel_cds.fasta")

#efetch("NC_052523", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=2, seqstart=26165102, seqstop=26166121, outfile="./Final_ass/Dsim_cds.fasta")

#efetch("NW_020825194", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=1, seqstart=2228933, seqstop=2229932, outfile="./Final_ass/Dere_cds.fasta")

#efetch("NW_018367383", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=2, seqstart=6093667, seqstop=6094458, outfile="./Final_ass/Dser_cds.fasta")

#efetch("NC_046679", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=1, seqstart=811678, seqstop=812625, outfile="./Final_ass/Dpse_cds.fasta")

#Delete entry repetitions in Dere_obp83b_cds.fasta, Dmel_obp83b_cds, Dsim_obp83b_cds

new_names<-c("Dmel", 
             "Dsim", 
             "Dere", 
             "Dser", 
             "Dpse")
for (name in new_names){
 EntryName("Final_ass", name)
}
system("mv Final_ass/obp83b_cds.fasta Final_ass/fa_cds.fasta")
## read the noncoding sequences
myseqs<-read.dna("Final_ass/fa_cds.fasta", format="fasta")

## multiple sequence alignment
myalign <- clustal(myseqs)

## write the alignment to file
write.dna(myalign, file="Final_ass/fa_aln.fasta", format="fasta")
a<-read.dna("Final_ass/fa_aln.fasta", format="fasta")

old_names<-get.fasta.name("Final_ass/fa_aln.fasta")
new_names<-c("Dmel", 
              "Dsim", 
              "Dere", 
              "Dser", 
              "Dpse")
ref2 <- data.frame(old_names, new_names)
rename.fasta(infile = "Final_ass/fa_aln.fasta", ref_table = ref2, outfile = "Final_ass/fa_renamed_aln.fasta")
Final_ass/fa_renamed_aln.fasta has been saved to  /home/jj/Desktop/Bioinformatics/2nd_year/3term/Population_Genetics_ME/Seminars/P4_Estimating_nucleotide_divergences+evolutionary_rates 
## read the noncoding alignment
cds <- read.alignment("Final_ass/fa_renamed_aln.fasta", format="fasta")

kaks(cds)$ks #synonymous
           Dere       Dmel       Dpse       Dser
Dmel 0.04469301                                 
Dpse 0.50454520 0.51472975                      
Dser 0.52481511 0.52849324 0.60968113           
Dsim 0.03733297 0.02182090 0.50377590 0.52545067
kaks(cds)$ka #Non-synomymous
           Dere       Dmel       Dpse       Dser
Dmel 0.00000000                                 
Dpse 0.01145412 0.01143935                      
Dser 0.01139894 0.01236601 0.01435741           
Dsim 0.00000000 0.00000000 0.01143935 0.01236601
library(phangorn)

cds <- read.alignment("Final_ass/fa_renamed_aln.fasta",format="fasta")

syntree <- NJ(kaks(cds)$ks)
write.tree(syntree, file="Final_ass/fa_Ks.tree")
plot(syntree)

tree<-read.tree(file="Final_ass/fa_Ks.tree")

rooted<-root(tree, "Dere") #We choose D.erecta as outgroup because it is not in any cluster

mrca_node <- getMRCA(rooted, c("Dser", "Dpse")) #I choose this two because they form a cluster

#I use the same distances that in the example
cal <- data.frame(
    node = mrca_node,
    age.min = 6.1,
    age.max = 18.9
)
  
    chr<-chronos(rooted, model="clock", calibration = cal)

Setting initial dates...
Fitting in progress... get a first set of estimates
         (Penalised) log-lik = -2.295721 
Optimising rates... dates... -2.295721 

log-Lik = -2.295721 
PHIIC = 12.59 
plot(chr, main="calibrated tree (million years)")
axisPhylo()
tiplabels()
nodelabels()

chr$edge
     [,1] [,2]
[1,]    6    7
[2,]    7    1
[3,]    7    2
[4,]    6    8
[5,]    8    3
[6,]    8    4
[7,]    6    5
chr$edge.length
[1] 10.838461  1.098351  1.098351  3.547680  8.389132  8.389132 11.936811

5. Which is the evolutionary rate of the RpL32 protein?

Answer:

divergence_times <- c(dmel_dsim = 6.1, dmel_dere = 66, dmel_dser = 66, dmel_dpse = 66)
ks_values <- c(dmel_dsim = 0.021182090, dmel_dere = 0.04469301, dmel_dser = 0.01236601, dmel_dpse = 0.01143935)
evolutionary_rates <- ks_values /2*divergence_times
mean_evolutionary_rate <- mean(evolutionary_rates)
mean_evolutionary_rate
[1] 0.5812629
The evolutionary rate is 0.5812629

6. Are the evolutionary rates of the two proteins studied in this practice different?

Answer:

Yes, the evolutionary rate of obp83b was 0.4265788 while this one is 0.5812629

7. Can you provide a molecular evolutionary explanation for the answer to question 6?

Answer:

Because of difference of presence of variables like functional constraints, selective pressures, genetic redundancy, environmental interactions, and protein essentiality which can account for the more rapid evolutionary rate of one of the 2 proteins

8. Which of the other positions examined here (i.e. nonsynonymous, noncoding, the full cds…) could have been used to reconstruct the phylogenetic tree prior to calibration?

Answer:

The CDS could have been used, particularly when combined with synonymous and noncoding regions to balance the evolutionary signals.

9. Looking at your results and the tree in Figure 2, do you think that the fact that some of the species are not the same in both analyses (D. elegans in obp83b and D. serrata in rpL32) influences the results? Discuss

Answer:

The fact that they are different species probably does affect the analysis, between species there may be differences like observed evolutionary rates, divergence times, selective pressures, and lineage-specific evolutionary dynamics, which affect the analysis. We chould use the same set of species for both genes for a better analysis.

Deliver info

Submit this document with your answers (or just a document with your answers to the questions in any readable format, e.g. word, pdf, plain text…), and the R code used to complete the final assignment to AULAESCI

Deadline: June 28, 2024

Doubts?